File System Implementation¶

File system resides on secondary storage

Layered File System¶

Logical file system
Keep all the meta-data necessary for the file system
It stores the directory structure
It stores a data structure that stores the file description (File Control Block - FCB)
Input from above:
- Open/Read/Write filepath
Output to below:
- Read/Write logical blocks
File-organization module
Knows about logical file blocks (from 0 to N) and corresponding physical file blocks: it performs translation

把逻辑块映射到物理块。输入是逻辑块号，输出是物理块号。
It also manages free space
Basic file system
Allocates/maintains various buffers that contain file-system, directory, and data blocks.

提供 buffer，用于缓存文件系统、目录和数据块。在 Linux 中称为 IO buffer。
I/O Control

Device drivers and interrupt handlers.

I/O control 将上层的指令转换为 low-level, hardware-specific 的指令来实现相关操作。同时也可以发中断。

分层是为了降低复杂度，通过接口来隔离不同层。但也降低了性能。

File System Data Structures¶

on-disk 的是可持久化的（persisitant），in-memory 的是易失的（volatile）。

On-disk structures
An optional boot control block
A volume control block
A directory
A per-file File Control Block (FCB)
In-memory structures
A mount table with one entry per mounted volume
A directory cache for fast path translation (performance)
A global open-file table
A per-process open-file table
Various buffers holding disk blocks “in transit” (performance)

File Control Block¶

ext2_inode：前面是metadata，后面指针指向data，overhead很大

File Creation:

Logical file system allocates a new FCB
Appropriate directory is updated with the new file name and FCB

Operations - open() and close()

系统调用 open() 将文件名传给 logical file system，后者搜索 system-wide open-file table以确定该文件是否正在被其他进程使用。
如果有，则直接在当前进程的 per-process open-file table 中新建一个 entry，指向 system-wide open-file table 中的对应项即可。
否则，需要在 directory 中找到这个 file name，并将其 FCB 从磁盘加载到内存中，并将其放在 system-wide open-file table 中。然后，在当前进程的 per-process open-file table 中新建一个 entry，指向 system-wide open-file table 中的对应项

图(b)中的index就是我们read/write传入的参数fd
同一个process的不同thread之间共享
system-wide open-file table中每一条都有一个open count记录开了几次
close一个时，从per-process open-file table移去对应entry，减少一个open count
open count清零时从system-wide open-file table移去
在 Unix 里面（UFS）System-Wide Open-File Table 会放设备、网络，所以我们的设备也是用文件来表示的，读写文件相当于读写设备。
inode numbering system is only unique within a given file system

Virtual File Systems¶

操作系统可以handle各种类型的文件系统

Linux通过定义统一的接口并在下层实例化为具体操作来实现VFS

VFS provides an object-oriented way of implementing file systems

操作系统为所有文件系统设置了一套统一的interface，所有的system call都基于这套interface实现

Linux defines four VFS object types:

superblock: defines the file system type, size, status, and other metadata
inode: contains metadata about a file (location, access mode, owners…)
dentry: associates names to inodes, and the directory layout
file: actual data of the file

VFS的大多数interface都是指针，指向对应文件系统真正实现的部分

Write syscall -> vfs_write -> indirect call -> ext4_file_write_iter
when is file->f_op setted?
- 在我们open这个文件的时候，file->f_op 就被设为了对应的函数表的地址（f_op 是指针）。

Directory Implementation¶

Directory is a special file, storing the mapping from file name to inode.

他的数据块有自己的名字（目录项 dir_entry），每一个目录项有一个 inode号、目录项长度、文件名长度和文件名。

d_entry从人能读懂的文件名转换成计算机能识别的

目录的data block中装的是路径

一般4byte对齐

目录项长度：方便查找，不符合直接跳过，用空间换时间

为hard link

最简单的实现方式是 linear list，即维护 dir_entry[]。这种方案的缺点是查找文件很费时。

使用有序数据结构（有序表、平衡树、B+ 树等）能够优化这一问题。

使用 hash table 也可以解决这一问题。

创建一个文件：首先找到当前目录的 inode，在其指向的数据块里加上一个目录项。（在之前要先分配一个 inode 随后才能放入目录项）

Disk Block Allocation¶

Files need to be allocated with disk blocks to store data

Contiguous Allocation¶

Each file is in a set of contiguous blocks

Good because sequential access causes little disk head movement, and thus shorten seek times

缺点是会碎片化(external fragmentation)，同时文件如果增大需要重新分配空间

Linked Allocation¶

Each file is a linked list of disk blocks

Blocks may be scattered anywhere on the disk (no external fragmentation, no compaction)

缺点：搜索很慢，I/O很高，pointer很浪费空间，reliability不高，如果一个pointer被corrupt，当前及其往后的都找不到了

改进: cluster the blocks - like 4 blocks

FAT (File Allocation Table) uses linked allocation DOS

Indexed Allocation¶

Each file has its own index blocks of pointers to its data blocks

用一个块只做 index，里面存放指向数据块的指针。

Index table provides random/direct access to file data blocks
No external fragmentation, but overhead of index blocks
Allows holes in the file
Index block needs space - waste for small files

需要一个方法分配 index block 的大小（太大会浪费，太小那么指向的空间小）。我们可以把 index block 链接起来，或者用多级索引

如果 block size 为 4KB，那么 Linux 中能创建的最大文件大小为 4TB+4GB+4MB+48KB。如果我们有一个 10KB 的文件，那么只需要前 3 个 direct pointer 就可以，后面的指针都是 NULL，不需要展开。

Free-Space Management¶

Bitmap¶

Use one bit for each block, track its allocation status

相对容易找到连续的block 需要额外的空间

Linked Free Space¶

Keep free blocks in linked list

No waste of space, just use the memory in the free block for pointers

找到一个空的很方便

Cannot get contiguous space easily
Allocating multiple free blocks require traverse the list
Usually no need to traverse the entire list: return the first one

Grouping and Counting¶

Grouping: use indexes to group free blocks
Store address of n-1 free blocks in the first free block, plus a pointer to the next index block
Allocating multiple free blocks does not need to traverse the list
Counting: a link of clusters (starting block + # of contiguous blocks)
Space is frequently contiguously used and freed
In link node, keep address of first free block and # of following free blocks

File System Performance¶

To improve file system performance:

Keeping data and metadata close together
Use cache: separate section of main memory for frequently used blocks
Use asynchronous writes, it can be buffered/cached, thus faster
Free-behind and read-ahead: techniques to optimize sequential access - remove the previous page from the buffer, read multiple pages ahead

Page Cache¶

A page cache caches pages for MMIO, such as memory mapped files

A unified buffer cache uses the same page cache to cache both memory-mapped pages and disk I/O to avoid double caching

File and Directory in Practice¶

Two Key Abstractions
File
- linear array of bytes
- has a low-level name - inode number
- 通常OS不知道每个文件的具体类型
Directory
- 包含一系列low-level的用户可读的名字
- 每个entry指向对应文件或其他目录
fd - fd=0 stdin - fd=1 stdout - fd=2 stderr - 因此用户能用的从fd=3开始
when removing files, using unlinkat, unlink(AT_FDCWD, "tmp", 0) for example
Link - A file may be known by more than one name in one or more directories. - Such multiple names are known as links. - Two kinds of links are also known as hard links and soft links. - Hard link 不能跨文件系统
- A hard link is a directory entry that associates with a file
- The file name “." in a directory is a hard link to the directory itself
- The file name ".." is a hard link to the parent directory
- 以上两个文件对任何都有，因此创造一个目录就会有这两项
- 同时如果父文件夹还存在其他子文件夹，那么这些子文件夹的..也会对这个文件夹产生hard link
- hard link时link后的两个文件inode一样
- 删除最后一个hard link，对应的文件也就从磁盘中删除了
- ln命令
- Soft link (a.k.a., Symbolic link or symlink) 可以跨文件系统
- A symbolic link is a file containing the path name of another file
- Soft links, unlike hard links, may point to directories and may cross file-system boundaries
- ln -s f1 f2，将f1 link到f2，从而f2指向f1
- hard link更经济，现在用的更多

进程：OS进行资源分配调度的基本单元

线程：执行单元

An Example of FS Organization¶

suppose 64 blocks, size 4KB
需要存的有data, inode, bitmap(一个inode bitmap，一个data region bitmap), 以及一个superblock
56个用来存data
5个用来存inode。假设inode是256bytes，一个block可以存16个，5个一共可以支持80个inode，因此最多存80个file或directory

df -i指令查看可用和当前使用inode数量

Example¶

第一个read找到root的data block，之后遍历directory entry，找到foo的，之后read foo inode以找到data block，遍历directory entry，找到后read bar的inode
之后read，write的时候更新access time

create的时候和上文类似，遍历foo的目录项找不到bar，因此read inode bitmap，发现有可以写的地方后write，之后在foo中写入bar对应的目录项，并初始化bar的inode，最后更新foo(即父目录)的inode的metadata
write的时候首先read bar的inode，之后read data bitmap查看是否可以写，并写入，再写入bar的data block